Motivation: Stories from Hongyan Li

I still remember the blizzard in January 27, 2015, we were out of heating because the extreme cold froze the critical part of my furnace, and no one could come to repair in heavy snow. Finally, it turned out that winter was recorded as the worst winter in the history of Boston! Not only the winter, we were also shocked by the recent hot summers. I always felt that summer was a comfortable season in Boston before, but we were surprised to have the warmest August on record in 2018 and 2016. As we experienced extreme weathers in the past several years, we would like to know the trend of weather conditions in Boston and some other cities in US in recent years.

Several years ago, my Mom came to US in winter and our whole family planned to go to Grand Canyon and California for a vacation. We were so excited and planned this trip since my mother got the visa months ago. We first arrived in New York for transfer, but we all were stuck in JFK airport because of the heavy snow, the flight was delay for more than 10 hours, and we were extremely tired, especially my daughter. The bad luck didn’t end after we boarded on the plane. It started to snow in the afternoon when we were leaving Grand Canyon. Driving downhill on the newly falling slippery snow in the dark was really a challenge even though I had been driving in Minnesota and Wisconsin for years. My little daughter even got a high fever, and we had to cancel all the plans afterwards. I always remember this frustrated experience, so we would like to make some recommendations to help visitors to choose best times to visit certain place or best places to visit in certain month.

Data Sources

Data was downloaded from Kaggle. The dataset contains around 5 years of high temporal resolution (hourly measurements) data of various weather attributes, including temperature, humidity, wind speed, and weather description, from 30 US & Canadian Cities and 6 Israeli Cities.

Here, we only analyzed the following 10 US coastal cities, which are well-known tourism cities. (west coast: Seattle, Portland, San Francisco, Los Angeles, and San Diego; east coast: Boston, New York, Philadelphia, Jacksonville, Miami)

Methodology and Data Analysis

import data

library(readr)

temp <- read_csv("/Users/hongyanli/Desktop/BST260project/weather_data/temperature.csv")
hum <- read_csv("/Users/hongyanli/Desktop/BST260project/weather_data/humidity.csv")
wind <- read_csv("/Users/hongyanli/Desktop/BST260project/weather_data/wind_speed.csv")
des <- read_csv("/Users/hongyanli/Desktop/BST260project/weather_data/weather_description.csv")

extract year, month, date, hour; select data from 2012-10-01 to 2017-09-30

library(dplyr)
library(lubridate)

temp=temp %>% 
  mutate(datetime=ymd_hm(datetime)) %>% 
  mutate(year=year(datetime), month=month(datetime), day=day(datetime), hour=hour(datetime)) %>%
  filter(!(year==2017 & month > 9))

hum=hum %>% 
  mutate(datetime=ymd_hm(datetime)) %>% 
  mutate(year=year(datetime), month=month(datetime), day=day(datetime), hour=hour(datetime)) %>%
  filter(!(year==2017 & month > 9))

wind=wind %>% 
  mutate(datetime=ymd_hm(datetime)) %>% 
  mutate(year=year(datetime), month=month(datetime), day=day(datetime), hour=hour(datetime)) %>%
  filter(!(year==2017 & month > 9))

des=des %>% 
  mutate(datetime=ymd_hm(datetime)) %>% 
  mutate(year=year(datetime), month=month(datetime), day=day(datetime), hour=hour(datetime)) %>%
  filter(!(year==2017 & month > 9))

# convert temperature from Kelvin to Celsius
datetime=temp[,1]
temp=cbind(datetime, temp[,c(2:28)]-273.15, temp[,c(29:32)])

temp$term[temp$year==2012 | (temp$year==2013 & temp$month<=9)]="2012.10-2013.09"
temp$term[(temp$year==2013 & temp$month>=10) | (temp$year==2014 & temp$month<=9)]="2013.10-2014.09"
temp$term[(temp$year==2014 & temp$month>=10) | (temp$year==2015 & temp$month<=9)]="2014.10-2015.09"
temp$term[(temp$year==2015 & temp$month>=10) | (temp$year==2016 & temp$month<=9)]="2015.10-2016.09"
temp$term[(temp$year==2016 & temp$month>=10) | (temp$year==2017 & temp$month<=9)]="2016.10-2017.09"

temp$term2[temp$term=="2012.10-2013.09"]="1"
temp$term2[temp$term=="2013.10-2014.09"]="2"
temp$term2[temp$term=="2014.10-2015.09"]="3"
temp$term2[temp$term=="2015.10-2016.09"]="4"
temp$term2[temp$term=="2016.10-2017.09"]="5"

des$term[des$year==2012 | (des$year==2013 & des$month<=9)]="2012.10-2013.09"
des$term[(des$year==2013 & des$month>=10) | (des$year==2014 & des$month<=9)]="2013.10-2014.09"
des$term[(des$year==2014 & des$month>=10) | (des$year==2015 & des$month<=9)]="2014.10-2015.09"
des$term[(des$year==2015 & des$month>=10) | (des$year==2016 & des$month<=9)]="2015.10-2016.09"
des$term[(des$year==2016 & des$month>=10) | (des$year==2017 & des$month<=9)]="2016.10-2017.09"

des$term2[des$term=="2012.10-2013.09"]="1"
des$term2[des$term=="2013.10-2014.09"]="2"
des$term2[des$term=="2014.10-2015.09"]="3"
des$term2[des$term=="2015.10-2016.09"]="4"
des$term2[des$term=="2016.10-2017.09"]="5"

hum$term[hum$year==2012 | (hum$year==2013 & hum$month<=9)]="2012.10-2013.09"
hum$term[(hum$year==2013 & hum$month>=10) | (hum$year==2014 & hum$month<=9)]="2013.10-2014.09"
hum$term[(hum$year==2014 & hum$month>=10) | (hum$year==2015 & hum$month<=9)]="2014.10-2015.09"
hum$term[(hum$year==2015 & hum$month>=10) | (hum$year==2016 & hum$month<=9)]="2015.10-2016.09"
hum$term[(hum$year==2016 & hum$month>=10) | (hum$year==2017 & hum$month<=9)]="2016.10-2017.09"

hum$term2[hum$term=="2012.10-2013.09"]="1"
hum$term2[hum$term=="2013.10-2014.09"]="2"
hum$term2[hum$term=="2014.10-2015.09"]="3"
hum$term2[hum$term=="2015.10-2016.09"]="4"
hum$term2[hum$term=="2016.10-2017.09"]="5"

wind$term[wind$year==2012 | (wind$year==2013 & wind$month<=9)]="2012.10-2013.09"
wind$term[(wind$year==2013 & wind$month>=10) | (wind$year==2014 & wind$month<=9)]="2013.10-2014.09"
wind$term[(wind$year==2014 & wind$month>=10) | (wind$year==2015 & wind$month<=9)]="2014.10-2015.09"
wind$term[(wind$year==2015 & wind$month>=10) | (wind$year==2016 & wind$month<=9)]="2015.10-2016.09"
wind$term[(wind$year==2016 & wind$month>=10) | (wind$year==2017 & wind$month<=9)]="2016.10-2017.09"

wind$term2[wind$term=="2012.10-2013.09"]="1"
wind$term2[wind$term=="2013.10-2014.09"]="2"
wind$term2[wind$term=="2014.10-2015.09"]="3"
wind$term2[wind$term=="2015.10-2016.09"]="4"
wind$term2[wind$term=="2016.10-2017.09"]="5"

select the 10 US cities

temp=temp %>% 
  mutate(date=ymd(paste(year, month, day, sep="-"))) %>% 
  select(date, term, term2, year, month, day, hour, Seattle, Portland, `San Francisco`, `Los Angeles`, `San Diego`, Boston, `New York`, Philadelphia, Jacksonville, Miami)

des=des %>% 
  mutate(date=ymd(paste(year, month, day, sep="-"))) %>% 
  select(date, term, term2, year, month, day, hour, Seattle, Portland, `San Francisco`, `Los Angeles`, `San Diego`, Boston, `New York`, Philadelphia, Jacksonville, Miami)

hum=hum %>% 
  mutate(date=ymd(paste(year, month, day, sep="-"))) %>% 
  select(date, term, term2, year, month, day, hour, Seattle, Portland, `San Francisco`, `Los Angeles`, `San Diego`, Boston, `New York`, Philadelphia, Jacksonville, Miami)

wind=wind %>% 
  mutate(date=ymd(paste(year, month, day, sep="-"))) %>% 
  select(date, term, term2, year, month, day, hour, Seattle, Portland, `San Francisco`, `Los Angeles`, `San Diego`, Boston, `New York`, Philadelphia, Jacksonville, Miami)

calculate heat index

if (!require("weathermetrics")) {
  install.packages("weathermetrics")
  library("weathermetrics")
}

temp=as.data.frame(temp)
hum=as.data.frame(hum)
heat=temp

for(i in c(8:17)){
  heat[,i]=heat.index(t=temp[,i], rh=hum[,i], temperature.metric = "celsius", output.metric = "celsius", round = 1)
}

calculate wind chill

if (!require("ThermIndex")) {
  install.packages("ThermIndex")
  library("ThermIndex")
}

wind=as.data.frame(wind)
chill=temp

for(i in c(8:17)){
  chill[,i]=wc(temp[,i], wind[,i])
}

calculate apparent temperature

Apparent temperature, is derived from either a combination of temperature and wind wind chill or temperature and humidity heat index for the indicated hour. When the temperature is lower than 10\(^0C\) (50\(^0F\)) and the wind speed is higher than 1.34 m/s (3 mph), wind chill was used for that point for the Apparent Temperature. When the temperature is above 26.7\(^0C\) (80\(^0F\)), the heat index will be used for apparent temperature. Otherwise, the apparent temperature will be the ambient air temperature.

app=temp

for(i in c(1:43812)){
  for (j in c(8:17)){
    app[i,j]=ifelse(is.na(app[i,j]), NA, ifelse(app[i,j]>26.7, heat[i,j], ifelse(app[i,j]<10 & wind[i,j]>1.34, chill[i,j], temp[i,j])))
  }
}

check NA proportion

a=heat %>% summarize_all(funs(sum(is.na(.)) / length(.)))
a=data.frame(t(a))
colnames(a)="heat_NA"

b=chill %>% summarize_all(funs(sum(is.na(.)) / length(.)))
b=data.frame(t(b))
colnames(b)="chill_NA"

c=temp %>% summarize_all(funs(sum(is.na(.)) / length(.)))
c=data.frame(t(c))
colnames(c)="temp_NA"

d=des %>% summarize_all(funs(sum(is.na(.)) / length(.)))
d=data.frame(t(d))
colnames(d)="des_NA"

e=app %>% summarize_all(funs(sum(is.na(.)) / length(.)))
e=data.frame(t(e))
colnames(e)="app_NA"

f=hum %>% summarize_all(funs(sum(is.na(.)) / length(.)))
f=data.frame(t(f))
colnames(f)="hum_NA"

g=wind %>% summarize_all(funs(sum(is.na(.)) / length(.)))
g=data.frame(t(g))
colnames(g)="wind_NA"

NA_prop=cbind(a,b,c,d,e,f,g)*100
NA_prop=NA_prop[c(8:17),]

Result (1)

plot daily highest, lowest, average temperature

library(tidyverse)
library(ggplot2)
library(gridExtra)

if (!require("grid")) {
  install.packages("grid")
  library("grid")
}

plot=list()
for(i in c(8:17)){
  b=names(temp[i])
  a=cbind(temp[,1:7], temp=temp[,i])
  names(a)[8]="temp"
  
  aa=a %>% 
    group_by(year, month, day) %>% 
    summarize(highest=max(temp, na.rm=TRUE), average=mean(temp, na.rm=TRUE), lowest=min(temp, na.rm=TRUE))
  
  aaa=left_join(aa,(a %>% select(year, month, day, term, term2, date)),by=c("year", "month", "day"))
  
  aaa=aaa %>% 
    filter(!duplicated(date)) %>% 
    gather(type, temp, c(`highest`, `average`, `lowest`))
 
  trendhigh=lm(temp~time(temp), data=aaa[aaa$type=="highest",])
  trendavg=lm(temp~time(temp), data=aaa[aaa$type=="average",])
  trendlow=lm(temp~time(temp), data=aaa[aaa$type=="lowest",])
  aaa$trend=c(fitted(trendhigh), fitted(trendavg), fitted(trendlow))
  
  p=ggplot(aaa, aes(date, temp, color=term)) +
    geom_point(alpha=1) +
    geom_line(aes(date, trend), color="black") +
    scale_y_continuous("daily temperature (??)", breaks=c(-20,-10,0,10,20,30,40)) +
    xlab("") +
    ggtitle(b)+
    facet_grid(. ~ type)  +
    theme(title = element_text(colour="black",size=13),
          axis.text = element_text(colour="grey20",size=11),
          axis.title = element_text(colour="black",size=12),
          legend.text = element_text(colour="black",size=12),
          legend.title = element_text(colour="black",size=12),
          strip.text.x = element_text(colour="black",size=12)) +
    theme(plot.title = element_text(hjust = 0.5)) +
    theme(legend.position = "none")
    
    plot[[i-7]]=p
}

grid.arrange(plot[[1]], plot[[2]], plot[[3]], plot[[4]], plot[[5]], plot[[6]], plot[[7]], plot[[8]], plot[[9]], plot[[10]], ncol=2, nrow=5)

We organized all the temperatures in average, highest, and lowest of each day in the past fives years. The scatterplots for hte 10 cities are shown below, and we could visually see that the temperature are roughly following an increasing trend over the past five years, especially in west coast cities like Portland, San Francisco, Los Angeles, and San Diego, and east coast cities such as Jacksonville and Miami.

Result (2)

plot difference between apparent temperature and air temperature

if (!require("zoo")) {
  install.packages("zoo")
  library("zoo")
}

plot=list()

grid_arrange_shared_legend <- function(..., ncol = length(list(...)), nrow = 1, position = c("bottom", "right")) {
    plots <- list(...)
    position <- match.arg(position)
    g <- ggplotGrob(plots[[1]] + 
    theme(legend.position = position))$grobs
    legend <- g[[which(sapply(g, function(x) x$name) == "guide-box")]]
    lheight <- sum(legend$height)
    lwidth <- sum(legend$width)
    gl <- lapply(plots, function(x) x +
    theme(legend.position = "none"))
    gl <- c(gl, ncol = ncol, nrow = nrow)

    combined <- switch(position,
                       "bottom" = arrangeGrob(do.call(arrangeGrob, gl), 
                       legend,ncol = 1,
                    heights = unit.c(unit(1, "npc") - lheight, lheight)),
                    "right" = arrangeGrob(do.call(arrangeGrob, gl),
                  legend, ncol = 2,
                    widths = unit.c(unit(1, "npc") - lwidth, lwidth)))

    grid.newpage()
    grid.draw(combined)

    invisible(combined)
}

for(i in c(8:17)){
  cityname=names(temp[i])
  temp_city=cbind(temp[,1:7], city=temp[,i])
  names(temp_city)[8]="temp"
  app_city=cbind(app[,1:7], city=app[,i])
  names(app_city)[8]="app"

temp_city <- temp_city %>% 
  group_by(year, month, day) %>% 
  summarise(highest=max(temp, na.rm=TRUE), lowest=min(temp, na.rm=TRUE)) %>% 
  ungroup() %>% 
  group_by(year, month) %>%
  summarise(highest_air=mean(highest), lowest_air=mean(lowest)) %>% 
  ungroup()
  
app_city <- app_city %>% 
  group_by(year, month, day) %>% 
  summarise(highest=max(app, na.rm=TRUE), lowest=min(app, na.rm=TRUE)) %>% 
  ungroup() %>% 
  group_by(year, month) %>%
  summarise(highest_apparent=mean(highest), lowest_apparent=mean(lowest)) %>% 
  ungroup() %>%
  left_join(temp_city, by=c("year","month"))

app_city=app_city %>% gather(type, temperature, -month, -year)  
mon=paste(app_city$year, app_city$month, sep="-")
mon=as.yearmon(mon)
app_city=cbind(app_city, mon=mon)

p <-  app_city %>%
  ggplot(aes(mon, temperature, col=type, group=type)) + 
  geom_line(size=0.8) +
  xlab("") +
  scale_y_continuous("daily temperature (??)", breaks=c(-20,-10,0,10,20,30,40)) +
  ggtitle(cityname) +
  theme(title = element_text(colour="black",size=15),
          axis.text = element_text(colour="grey20",size=13),
          axis.title = element_text(colour="black",size=14),
          legend.text = element_text(colour="black",size=15),
          legend.title = element_text(colour="black",size=15)) +
  theme(plot.title = element_text(hjust = 0.5))
  plot[[i-7]]=p
}

grid_arrange_shared_legend(plot[[1]], plot[[2]], plot[[3]], plot[[4]], plot[[5]], plot[[6]], plot[[7]], plot[[8]], plot[[9]], plot[[10]], ncol=3, nrow=4, position="right")

We plotted the daily highest and lowest air temperature comparing to the corresponding apparent temperatures of the 10 US cities in the past five years. A clear finding here is that the difference between the daily highest apparent temperature and air temperature is becoming larger in summer in the recent three years in San Francisco, Los Angeles, and Boston, compared to the earlier 2 years. This means people in these cities were feeling hotter than before.

Result (3)

extract weather description

term=list()

for(i in c(8:17)){
  term[[i-7]]=levels(as.factor(as.data.frame(des)[,i]))
}
term=unlist(term)
term=as.data.frame(term)
names(term)=c("term")
term=term %>% 
  filter(!duplicated(term))
term
##                                   term
## 1                        broken clouds
## 2                              drizzle
## 3                           few clouds
## 4                                  fog
## 5                                 haze
## 6              heavy intensity drizzle
## 7                 heavy intensity rain
## 8          heavy intensity shower rain
## 9                           heavy snow
## 10             light intensity drizzle
## 11         light intensity shower rain
## 12                          light rain
## 13                   light shower snow
## 14                          light snow
## 15                                mist
## 16                       moderate rain
## 17                     overcast clouds
## 18              proximity thunderstorm
## 19                    scattered clouds
## 20                         shower rain
## 21                        sky is clear
## 22                               smoke
## 23                                snow
## 24                             squalls
## 25                        thunderstorm
## 26        thunderstorm with heavy rain
## 27        thunderstorm with light rain
## 28              thunderstorm with rain
## 29                     very heavy rain
## 30                                dust
## 31                       freezing rain
## 32                               sleet
## 33               proximity shower rain
## 34    proximity thunderstorm with rain
## 35                 light rain and snow
## 36                  heavy thunderstorm
## 37 proximity thunderstorm with drizzle
## 38                                sand
## 39                    sand/dust whirls
## 40     thunderstorm with light drizzle
## 41                        volcanic ash
## 42                             tornado

plot weather description each year

plot=list()

color=c("sunny" ='#fb61d7',"cloudy" ='#00b6eb', "fog" ='#53b400',"haze" ='#c49a00', "rainy" ='#f8766d',"snow" ='#00c094', "storm" ='#a58aff')

for(i in c(8:17)){
  b=names(des[i])
  a=cbind(des[,1:7], d=des[,i])
  names(a)[8]="d"
  a=a %>% 
    filter(!is.na(d)) %>% 
    mutate(des=ifelse(d %in% c("broken clouds", "scattered clouds", "overcast clouds"), "cloudy", ifelse(d %in% c("fog", "mist"), "fog", ifelse(d %in% c("dust", "haze", "sand", "smoke", "sand/dust whirls", "volcanic ash"), "haze", ifelse(d %in% c("heavy snow", "snow", "light snow", "light shower snow"), "snow", ifelse(d %in% c("squalls", "thunderstorm", "proximity thunderstorm", "heavy thunderstorm", "tornado"), "storm", ifelse(d %in% c("few clouds", "sky is clear"), "sunny", "rainy"))))))) %>% 
    group_by(term, des) %>% 
    mutate(count=n()) %>% 
    filter(!duplicated(count)) %>%
    ungroup()
  
  p=a %>%
    mutate(des = reorder(des, count, FUN=mean)) %>%
    ggplot(aes(fill=des, y=count, x=term)) +
    geom_bar(stat="identity", position="fill", width=0.5) +
    ggtitle(b) +
    scale_y_continuous(labels = scales::percent) + 
    xlab("") + 
    ylab("Proportion") +
    theme(title = element_text(colour="black",size=15),
          axis.text = element_text(colour="grey20",size=13),
          axis.title = element_text(colour="black",size=14),
          legend.text = element_text(colour="black",size=15),
          legend.title = element_text(colour="black",size=15)) +
    theme(plot.title = element_text(hjust = 0.5)) +
    theme(axis.text.x = element_text(angle = 15, hjust = 0.5, vjust = 0.6)) +
    scale_fill_manual(values=color) 
    
    plot[[i-7]]=p
}

grid_arrange_shared_legend(plot[[1]], plot[[2]], plot[[3]], plot[[4]], plot[[5]], plot[[6]], plot[[7]], plot[[8]], plot[[9]], plot[[10]], ncol=3, nrow=4, position="right")

We plotted the daily highest and lowest air temperature comparing to the corresponding apparent temperatures of the 10 US cities in the past five years. A clear finding here is that the difference between the daily highest apparent temperature and air temperature is becoming larger in summer in the recent three years in San Francisco, Los Angeles, and Boston, compared to the earlier 2 years. This means people in these cities were feeling hotter than before.

Result (4)

average number of nice days in a year (comfortable for living)

The criteria for nice days are referenced form here

  1. High temperature between 65 and 85\(^0F\) (18.33-29.44\(^0C\))

  2. Maximum dew point temperature less than or equal to 65\(^0F\) (18.33\(^0C\))

  3. Peak daily wind (including gusts) less than 25 mph (11.176 m/s)

  4. Average daily cloud cover less than or equal to 65 percent

  5. No measurable precipitation

Dew point temperature is calculated using temperature and humidity. Cloud coverage is assigned to each description referencing from here.

dew=hum

for(i in c(1:43812)){
  for (j in c(8:17)){
    dew[i,j]=ifelse(is.na(hum[i,j]), NA, humidity.to.dewpoint(rh=hum[i,j],t=temp[i,j], temperature.metric = "celsius"))
  }
}
highestfun=function(x){
    a=max(x, na.rm=TRUE)
    a
}
cc=function(x){
  c=ifelse(is.na(x), NA, ifelse(x=="broken clouds", 75, ifelse(x=="scattered clouds", 44, ifelse(x=="few clouds", 12.5, ifelse(x=="sky is clear", 0, 100)))))
  c
}
averagefun=function(x){
  a=mean(x, na.rm=TRUE)
  a
}
rain=function(x){
  a=ifelse(is.na(x), NA, ifelse(x %in% c("broken clouds", "scattered clouds", "overcast clouds", "fog", "mist", "dust", "haze", "sand", "smoke", "sand/dust whirls", "volcanic ash", "squalls", "thunderstorm", "proximity thunderstorm", "heavy thunderstorm", "tornado", "few clouds", "sky is clear"),0,1))
  a
}
sumfun=function(x){
  a=sum(x, na.rm=TRUE)
  a
}
nice_t=temp[,c(4:6, 8:17)] %>% 
  group_by(year, month, day) %>% 
  summarise_all(funs(high=highestfun))

nice_d=dew[,c(4:6, 8:17)] %>% 
  group_by(year, month, day) %>% 
  summarise_all(funs(high=highestfun))

nice_w=wind[,c(4:6, 8:17)] %>% 
  group_by(year, month, day) %>% 
  summarise_all(funs(high=highestfun))

cloud=des[,c(4:17)] %>% 
  group_by(year, month, day, hour) %>% 
  summarise_all(funs(c=cc))

nice_c=cloud[,c(1:3, 5:14)] %>% 
  group_by(year, month, day) %>% 
  summarise_all(funs(avg=averagefun))

rain=des[,c(4:17)] %>% 
  group_by(year, month, day, hour) %>% 
  summarise_all(funs(r=rain))

nice_r=rain[,c(1:3, 5:14)] %>% 
  group_by(year, month, day) %>% 
  summarise_all(funs(sum=sumfun))

niceday=nice_r

for(i in 1:1826){
  for(j in 4:13){
    niceday[i,j]=ifelse((nice_t[i,j]<29.44 & nice_t[i,j]>18.33 & nice_d[i,j]<=18.33 & nice_r[i,j]==0 & nice_w[i,j]<11.176 & nice_c[i,j]<=65),1,0)
  }
}

niceday$term[niceday$year==2012 | (niceday$year==2013 & niceday$month<=9)]="2012.10-2013.09"
niceday$term[(niceday$year==2013 & niceday$month>=10) | (niceday$year==2014 & niceday$month<=9)]="2013.10-2014.09"
niceday$term[(niceday$year==2014 & niceday$month>=10) | (niceday$year==2015 & niceday$month<=9)]="2014.10-2015.09"
niceday$term[(niceday$year==2015 & niceday$month>=10) | (niceday$year==2016 & niceday$month<=9)]="2015.10-2016.09"
niceday$term[(niceday$year==2016 & niceday$month>=10) | (niceday$year==2017 & niceday$month<=9)]="2016.10-2017.09"

nicedayresult=niceday %>% group_by(term) %>% summarise_all(funs(sum=sumfun))
nicedayresult=nicedayresult[,5:14]
names(nicedayresult)=c("Seattle", "Portland", "San Francisco", "Los Angeles", "San Diego", "Miami", "Jacksonville", "Philadelphia", "New York", "Boston")
nd=data.frame(apply(nicedayresult,2,mean))
nd$city=row.names(nd)
names(nd)=c("niceday", "city")

p=ggplot(nd, aes(x=reorder(city, niceday), y=niceday)) +
  geom_bar(stat='identity', colour="dodgerblue", fill="dodgerblue", width=0.6) +
  geom_text(aes(label=niceday), position=position_dodge(width=0.6), hjust=-0.3, vjust=0.3, color="grey20", size=4) +
  coord_flip() +
  xlab("") +
  ylab("") +
  scale_y_continuous(limits=c(0, 150)) +
  ggtitle("Average Number of Nice Days in a Year") +
  theme(title = element_text(colour="black",size=13),
        axis.text = element_text(colour="grey20",size=12),
        axis.title = element_text(colour="black",size=12),
        legend.text = element_text(colour="black",size=12),
        legend.title = element_text(colour="black",size=12)) +
  theme(plot.title = element_text(hjust = 0.5))
p

Ranked as 1st, Los Angeles has the most nice days in a year among the 10 US cities, and San Diego follows. Miami, which is super hot and humid, has the fewest number of nice days, may not be a good choice for long time living.

Result (5)

average weather score for each month (for travelling)

The weather score takes both apparent temperature and sunny time into consideration, with a scale of 20 (10 for each). A higher weather score means more suitable for traveling.

Temperature Score (referenced from here): For each hour between 8:00 AM and 6:00 PM of each day in the analysis period (2012.10-2017.09), independent scores are computed for apparent temperature. Those hourly scores are then aggregated into days, averaged over all the five years. Our temperature score is 0 for apparent temperatures below 10 °C (50°F), rising linearly to 9 for 18.3°C (65°F), to 10 for 23.9 °C (75°F), falling linearly to 9 for 26.7 °C (80°F), and to 0 for 32.2 °C (90°F) or hotter.

Functions for the temperature score against temeprature:

10-18.3\(^0C\): y=1.0843*x-10.8434

18.3-23.9\(^0C\): y=0.1786*x+5.7316

23.9-26.7\(^0C\): y=18.5347-0.3571*x

26.7-32.2\(^0C\): y=52.6921-1.6364*x

# temperature score
score_app=app %>% 
  filter(hour >=8 & hour <=18) %>% 
  gather(city, app, c(`Seattle`, `Portland`, `San Francisco`, `Los Angeles`, `San Diego`, `Boston`, `New York`, `Philadelphia`, `Jacksonville`, `Miami`)) %>%
  mutate(score=case_when(app < 10 | app >=32.2 ~ 0,
                         app >= 10 & app < 18.3 ~ (1.0843*app-10.8434),
                         app >= 18.3 & app < 23.9 ~ (0.1786*app+5.7316),
                         app >=23.9 & app < 26.7 ~ (18.5347-0.3571*app),
                         app >=26.7 & app < 32.2 ~ (52.6921-1.6364*app),
                         TRUE ~ NA_real_))
score_app=score_app %>%
  group_by(month, city) %>%
  mutate(score_tem=mean(score, na.rm=TRUE)) %>%
  filter(!duplicated(score_tem)) %>%
  select(month, city, score_tem)

Sunny Score (defined by ourselves): The proportion of sunny time between 8:00 AM and 6:00 PM is calculated for each month, and averaged over the past five years. All the sunny time proportions for each city and in each month (10*12=120 120 proportions) were ranked from the lowest to the highest (the rank of the highest proportion is 120, and the lowest is 1). Then our sunny score for a specific city and specific month is its rank divided by 12 (highest proportion has a score of 10, and the lowest 0.1).

# sunny score  
score_des=des %>% 
  filter(hour >=8 & hour <=18) %>% 
  gather(city, des, c(`Seattle`, `Portland`, `San Francisco`, `Los Angeles`, `San Diego`, `Boston`, `New York`, `Philadelphia`, `Jacksonville`, `Miami`)) %>% 
  filter(!is.na(des)) %>% 
  mutate(wea=ifelse(des %in% c("broken clouds", "scattered clouds", "overcast clouds"), "cloudy", ifelse(des %in% c("fog", "mist"), "fog", ifelse(des %in% c("dust", "haze", "sand", "smoke", "sand/dust whirls", "volcanic ash"), "haze", ifelse(des %in% c("heavy snow", "snow", "light snow", "light shower snow"), "snow", ifelse(des %in% c("squalls", "thunderstorm", "proximity thunderstorm", "heavy thunderstorm", "tornado"), "storm", ifelse(des %in% c("few clouds", "sky is clear"), "sunny", "rainy"))))))) %>% 
  group_by(year, month, city, wea) %>% 
  mutate(n=n()) %>% 
  filter(!duplicated(wea)) %>% 
  select(year, month, city, wea, n) %>% 
  group_by(year, month, city) %>% 
  mutate(prop=prop.table(n))

sunny_prop=score_des %>%
  filter(wea=="sunny") %>% 
  ungroup() %>%
  select(month, city, prop) %>%
  group_by(month, city) %>%
  mutate(prop=mean(prop)) %>% 
  filter(!duplicated(prop)) %>%
  ungroup() %>%
  mutate(score_sun=rank(prop)/12)

Weather Score = Temperature Score + Sunny Score

# total weather score
weather_score=left_join(score_app, sunny_prop, by=c("month", "city")) %>%
  select(month, city, score_tem, score_sun) %>%
  mutate(score_t=score_tem+score_sun) %>%
  ungroup()

Conclusion

From all the analyses, we found that in the past five years (from October 2012 to September 2017), the temperatures in the 10 chosen US coastal cities (Seattle, Portland, San Francisco, Los Angeles, San Diego, Boston, New York, Philadelphia, Jacksonville, and Miami) roughly followed an upward trend. People in some cities (San Francisco, Los Angeles, and Boston) were feeling hotter during the summer in the recent 3 years than before. While the temperature went up, the sunny time in west coast cities dropped over years, and the fog and haze time increased.

From weather’s perspective, we calculated nice days and weather scores for these cities. Los Angeles turned out to have the most nice days and received the highest weather score, even in December and January. So we highly recommend to visit Los Angeles in this coming Christmas and New Year’s holiday if you haven’t been there before.